Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing

نویسندگان

  • Zizhong Chen
  • Graham E. Fagg
  • Edgar Gabriel
  • Julien Langou
  • Thara Angskun
  • George Bosilca
  • Jack Dongarra
چکیده

"!# $ &% ' (*) + !-,. / 0 "' . + 1 . !" /, 32546 7 + 8' 9: !# + ;9< 9: =' !->? . + @' +!# ,5 !-,. !BA 8' >B(+ C ' ;D !. 5 !+ E "' (6 F !-,. G H "' I . + !, ' >#!8'3 !. JC 6>B . , + &% ' (*) K' L !B M' 6 ' >->B(+ 9 N M . ,. " O !-OP &% =' !-># M2Q!B . M R* ; !# , 9 >C N( N S =' !#>.4. + N F &% '8(*) F !-,. G 0 "' C + 9 !,5' >-!8' !. E 8' T E O&!#O. C &% C =' !-># I' %U4. H .4 2Q OP 5 ; !G' &% 0' !->.4F V'8OP T 1' 5 + >-OP ' V%W N L' X Y . Z [ , !!,\ 7'] N L' >9< N L' , 9< V' 8%

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Disaster Survival Guide in Petascale Computing: An Algorithmic Approach

1 Disaster Survival Guide in Petascale Computing: An Algorithmic Approach 3 Jack J. Dongarra, Zizhong Chen, George Bosilca, and Julien Langou 1.1 FT-MPI: A fault tolerant MPI implementation . . . . . . . . 6 1.1.1 FT-MPI Overview . . . . . . . . . . . . . . . . . . . . 6 1.1.2 FT-MPI: A Fault Tolerant MPI Implementation . . . 6 1.1.3 FT-MPI Usage . . . . . . . . . . . . . . . . . . . . . . 7 1....

متن کامل

MPI/FT: Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing

MPI has proven effective for parallel applications in situations with neither QoS nor fault handling. Emerging environments motivate fault -tolerant MPI middleware. Environments include space -based, wide -area/web/meta computing, and scalable clusters. MPI/FT , the system described here, trades off sufficient MPI fault coverage against acceptable parallel performance, based on mission requirem...

متن کامل

In-Memory Checkpointing for MPI Programs by XOR-Based Double-Erasure Codes

Today, the scale of High performance computing (HPC) systems is much larger than ever. This brings a challenge to fault tolerance of HPC systems. MPI (Message Passing Interface) is one of the most important programming tools for HPC. There are quite a few fault-tolerant extensions for MPI, such as MPICH-V, StarFish, FT-MPI and so on. Most of them are based on on-disk checkpointing. In this pape...

متن کامل

MPI/FTTM: Architecture and Taxonomies for Fault-Tolerant, Message-Passing Middleware for Performance-Portable Parallel Computing

MPI has proven effective for parallel applications in situations with neither QoS nor fault handling. Emerging environments motivate fault-tolerant MPI middleware. Environments include space-based, wide-area/web/meta computing, and scalable clusters. MPI/FT, the system described here, trades off sufficient MPI fault coverage against acceptable parallel performance, based on mission requirements...

متن کامل

Fault Tolerance in MPI Programs

This paper examines the topic of writing fault-tolerant MPI applications. We discuss the meaning of fault tolerance in general and what the MPI Standard has to say about it. We survey several approaches to this problem, namely checkpointing, restructuring a class of standard MPI programs, modifying MPI semantics, and extending the MPI specification. We conclude that within certain constraints, ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005